The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
The points distribution for this case is as follows:
ATTRIBUTES
COMPACTNESS (average perim)**2/area
CIRCULARITY (average radius)**2/area
DISTANCE CIRCULARITY area/(av.distance from border)**2
RADIUS RATIO (max.rad-min.rad)/av.radius
PR.AXIS ASPECT RATIO (minor axis)/(major axis)
MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)
SCATTER RATIO (inertia about minor axis)/(inertia about major axis)
ELONGATEDNESS area/(shrink width)**2
PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)
MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)
SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS
SCALED VARIANCE (2nd order moment about major axis)/area ALONG MINOR AXIS
SCALED RADIUS OF GYRATION (mavar+mivar)/area
SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 MAJOR AXIS
SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 MINOR AXIS
KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4 MINOR AXIS
KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4 MAJOR AXIS
HOLLOWS RATIO (area of hollows)/(area of bounding polygon)
Where sigma_maj2 is the variance along the major axis and sigma_min2 is the variance along the minor axis, and
area of hollows= area of bounding poly-area of object
The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.
NUMBER OF CLASSES
4 OPEL, SAAB, BUS, VAN
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Importing Data file
df = pd.read_csv('vehicle.csv').dropna()
df.shape
df.info()
df.isna().sum()
# 5 point summary
df.describe(include='all')
df.head()
#Since the variable is categorical, you can use value_counts function
pd.value_counts(df['class'])
import matplotlib.pyplot as plt
%matplotlib inline
pd.value_counts(df["class"]).plot(kind="bar")
As the model was not deemed to distinguish between the two cars,Saab 9000 and an Opel Manta 400.The number of entries classified under the class 'car' was double the number of entries that were classified under bus and van
sns.pairplot(df,hue='class')
Here,we see that when we use hue in order to distinguish between the classes,we see that the there is a clear distinction in the peaks of some of the attributes for the three different classes and hence these can be helpful in classification of the entries.
We can also see high correlation in many of the attributes hence reducing the dimensionality of the data set by dropping highly co-related data point or with the use of PCA is extremely important for this purpose.
df.corr()
From the Correlation matrix it can be seen that the variables scatter ratio,elongatedness,pr.axis_rectangularity,scaled variance and scaled variance.1 has a correlation of over 0.93 among each other including some of them that have a correlation of above 0.99.It seems that only one of the attributes can represent them within data set within negligible loss of information.Let us call this set of 5 attributes as HACS 1 (High Attribute Correlation Set).
Apart from the columns in HACS,we see high correlation between circularity and max. length rectangularity as well as scaled radius of gyration and circularity.There are correlation of above 0.86 among the three of them.Let us call them HACS 2.
We also can see that some of the attributes of HACS 1 and HACS 2 considerable correlation with each other.
In such a way we can eliminate a lot of attributes more intutive EDA or by using PCA.
sns.boxplot(x= 'class',y = 'compactness',data = df)
df.groupby(['class']).describe()['compactness']
sns.boxplot(x= 'class',y = 'circularity',data = df)
sns.boxplot(x= 'class',y = 'distance_circularity',data = df)
sns.boxplot(x= 'class',y = 'radius_ratio',data = df)
sns.boxplot(x= 'class',y = 'pr.axis_aspect_ratio',data = df)
sns.boxplot(x= 'class',y = 'max.length_aspect_ratio',data = df)
sns.boxplot(x= 'class',y = 'scatter_ratio',data = df)
sns.boxplot(x= 'class',y = 'max.length_rectangularity',data = df)
sns.boxplot(x= 'class',y = 'scaled_radius_of_gyration',data = df)
sns.boxplot(x= 'class',y = 'scaled_radius_of_gyration.1',data = df)
sns.boxplot(x= 'class',y = 'skewness_about',data = df)
sns.boxplot(x= 'class',y = 'skewness_about.1',data = df)
sns.boxplot(x= 'class',y = 'skewness_about.2',data = df)
sns.boxplot(x= 'class',y = 'hollows_ratio',data = df)
go for any clustering methods. You can use zscore function to do this
interest_df = df.drop('class',axis=1)
target_df = df.pop("class")
from scipy.stats import zscore
interest_df_z = interest_df.apply(zscore)
interest_df_z.head()
from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()
X_std = sc.fit_transform(interest_df)
y = target_df.replace({'car':1,'bus':2,'van':3})
y.head()
X_std[:,:]
X_std.shape
rf__covMatrix = np.cov(X_std,rowvar=False)
print(covMatrix)
from sklearn.decomposition import PCA
pca = PCA(n_components=18)
pca.fit(X_std)
Eigen Values
print(pca.explained_variance_)
Eigen Vectors
print(pca.components_)
print(pca.explained_variance_ratio_)
np.cumsum(pca.explained_variance_ratio_)
We can see from the principle component analysis we have done that the first seven component itself will cover more than 95% variance of the data.So,we can get sufficiently accurate results by just checking the first seven components itself.A step plot depicting the same is given in the below.
plt.step(list(range(0,18)),np.cumsum(pca.explained_variance_ratio_), where='post')
plt.ylabel('Cum of variation explained')
plt.xlabel('Eigen Value')
plt.show()
pca95 = PCA(n_components=7)
pca95.fit(X_std)
print(pca95.explained_variance_)
print(pca95.components_)
np.cumsum(pca95.explained_variance_ratio_)
# Splitting into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size = .30, random_state=0)
For the process of doing support vector Classifier (SVC) after principle component analysis(PCA),we are creating a pipeline and using number of components as 7 for the PCA and gamma =0.025 and C=3 as a trial for our pipeline.
target_names=['car','bus','van']
# Creating a pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import svm
pipe_trial = Pipeline([ ('pca', PCA(n_components=7)), ('clf', svm.SVC(gamma=0.025 , C =3))])
pipe_trial.fit(X_train, y_train)
print('Train Accuracy: %.3f' % pipe_trial.score(X_train, y_train))
Now for tuning the hyper parameters and cross validation,we will create a pipeline pipe_svc and with the grid parameters for PCA as 7 and 8 components and SVC with parameter as 0.01,0.05,0.5 and 1 and kernal type used in the model as both rbf and linear in Grid Search Cross Validation with the number of folds as 10.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report
pipe_svc = Pipeline([ ('pca', PCA()), ('svc', SVC())])
param_grid = {'pca__n_components':[7,8],'svc__C': [0.01, 0.05, 0.5, 1], 'svc__kernel':['rbf','linear']}
grid_svc = GridSearchCV( pipe_svc , param_grid = param_grid, cv = 10)
grid_svc.fit( X_train, y_train)
print(" Best cross-validation accuracy: {:.2f}". format( grid_svc.best_score_))
print(" Best parameters: ", grid_svc.best_params_)
y_pipe_svc_predict = grid_svc.predict(X_test)
print(" Test set accuracy: {:.2f}". format( grid_svc.score( X_test, y_test)))
print(classification_report(y_test, y_pipe_svc_predict, target_names=target_names))
It was seen that the best cross validation accuracy was obtained as 0.94 with the parameters number of PCA components as 8,C parameter in SVC as 1 and Radial Basis Function kernel type used for the algorithm .
From the classification report,it can be observed the prediction of the bus is most accurate as there is clear distinction the size of the bus in comparison to that of the car and van which are more similar to each other.The recall metric of 1 can clearly indicate the ability of the model to correctly classify the buses.
It is also seen that the among the car and van,the van is often misclassified as a car due to its similarity in size with the car.The unbalance between the data classes making car a majority class further exacerbate the recall metric of the van class making the lowest recall of 0.83 which all other classes have a recall above 0.95.
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
# Instantiate the pipeline for PCA and Random Forest Classifier
pipe_naive = Pipeline([ ('pca', PCA()), ('naive', GaussianNB())])
# Create the parameter grid based on the results of random search
param_grid = {
'pca__n_components':[7,8]
}
# Instantiate the grid search model
grid_naive = GridSearchCV(estimator = pipe_naive, param_grid = param_grid, cv = 10)
grid_naive.fit( X_train, y_train)
print(" Best cross-validation accuracy: {:.2f}". format( grid_naive.best_score_))
print(" Best parameters: ", grid_naive.best_params_)
y_pipe_naive_predict = grid_naive.predict(X_test)
print(" Test set accuracy: {:.2f}". format( grid_naive.score( X_test, y_test)))
print(classification_report(y_test, y_pipe_naive_predict, target_names=target_names))
The best cross validation accuracy obtained was 0.80 with number of comeponents as 8 in PCA and a test accuracy of 0.77 which is quite less compared to the Support Vector Classifier which above 0.9 accuracies in test as well as cross validation.
From the classification report,it can be observed the prediction of the car is most accurate as the unbalance of number of data points within each classes favours the majority class in the naive bayes algoritm inspite of the clear distinction the size of the bus in comparison to that of the car and van. Probabaly using upsampling,downsampling or SMOTE in Imbalanced learning library will lead to better results as even in the part the performance of recall parameter in imbalanced target classes have created problem while suing the
It is also seen that the among the car and van,the van is often misclassified as a car due to its similarity in size with the car which is same as seen with the support vector classifier.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Instantiate the pipeline for PCA and Random Forest Classifier
pipe_rf = Pipeline([ ('pca', PCA()), ('rf', RandomForestClassifier())])
# Create the parameter grid based on the results of random search
param_grid = {
'pca__n_components':[7],
'rf__bootstrap': [True],
'rf__max_depth': [3, 4, 5, 6],
'rf__max_features': [4, 5],
'rf__min_samples_leaf': [3, 4, 5],
'rf__min_samples_split': [8, 10, 12],
'rf__n_estimators': [10, 20, 30, 100]
}
# Instantiate the grid search model
grid_rf = GridSearchCV(estimator = pipe_rf, param_grid = param_grid, cv = 10)
grid_rf.fit( X_train, y_train)
print(" Best cross-validation accuracy: {:.2f}". format( grid_rf.best_score_))
print(" Best parameters: ", grid_rf.best_params_)
print(" Test set accuracy: {:.2f}". format( grid_rf.score( X_test, y_test)))
The code for the random forest was taking a long time for the run and didnt give accuracies even near to that of Support Vector Classifier with the best cross validation accuracy as 0.86 at the paramters given and the test accuracy of 0.85..So,the RandomForest is effective algorithm for this dataset.